i) Train dataset has the prediction variable (unit_sales) inaddition to the date, id ,store number, on_promotion and the item_number. ii) The Dependent variable unit_sales can be a integer , float or negative number(return products). iii) It is evident has onpromotion has Nan value. Let us deep dive in removing them in later part of this section. iv) Store number denotes the store in which sales happened. v) item number denotes the item which has the sales value. vi) Unit sales denotes the number of products sold on that particular day. vii) onpromotion denotes whether the product was on promotion or not. viii) id is the unique field.
Holiday dataset shows whether each day is a work day or holiday.
Oil dataset has oil price for each day.
i) Stores datset has the store number , city and state where the store is located
ii) Cluster is used to group stores
Sales is slightly less in the monh end and 1st day of the month.
There are 54 stores in total and the store_number 44 has the highest sales wheres as the store number 52 has the lowest sales
There are few outliers in our dependent variable.
Most of the products are not on promotion and there are some missing values in this field as well.
The bigger yearly periodic spike in transactions seem to occur at the end of the year in December. Perhaps this is due to some sort of Christmas sale/discount that Corporacion Favorita holds every December.
Perishable items are less in quantity compared to the non-perishable ones.
Cluster is the biggest one
# Quito City has the largest number of stores. It consist of store type A,B,C and D.Gauyaquil is the next state and it has all types of cities. Guayaquil and Quito are two cities that stand out in terms of the range of retail kinds available. These are unsurprising given that Quito is Ecuador's capital and Guayaquil is the country's largest and most populated metropolis. As a result, one might expect Corporacion Favorita to target these major cities with the most diverse store types, as evidenced by the highest counts of store nbrs attributed to those two cities.
Ecuador has all tpes of holidays wheres the other places have either holiday or additional leave.
125 million records has been reduced to 86 million
We can see that the count has been nearly half after filtering out the data from the original one.Let us seggregat for thr other datasets as well.
To minimise the number of rows , we are going to filter the dataframe based on some conditions; We are going to take only the job family Eggs
86 million records has been reduced to 9 lakh records. But we will further filter to one item
Item number 208384 is the item with more count and the item 1974848 is the item with least count
Converting the normal time into a time series format is very important. Lets do some EDA on our input dataframe named train_subset
Density Plot
Handling missing values in time series is different from the normal data because we cannot drop the values as the order of the time series will be disturbed and also we cannot impute the values with global mean and media as it affects the trend and seasonality components
we can also fill missing value by taking prev 2 or 3 values of mean and substitute(not global mean). the code for that is train_subset['oil_price'] = train_subset['oil_price'].rolling(window=2,min_periods=1).mean()
Identifies the data points or observations that deviate from the normal behavior of the dataset. it indicates critical incident in business that needs to be checked. Two Types: ********* Global outlier - this point is completely outside and far away.Naked eye will show this. may be its recorded by mistake as well or genuine as well. Contexual outlier - time series data outlier ; this will be available in the time series data but may be in trend or seasonal patterns.
solstice celebration ecuador festival on 20th june 2017
Cluster and oil_price is the only pair with a correlation value of 0.20. others are having less than 1
Time series has 4 main components-Trend,seasonality, stationarity and residuals
Decomposition(statistical technique) means removing the trend , seasonlity , Noise and cyclic behavior from the Time series data. They are the time series components. We can have only the stationarity component in the data to proceed he forecasting of the data. Now, let us see whether we have any of those components and if we have them , we can remove them by using the appropriate techniques inorder to do the better forecasting of the data. Note : After removing the components , whatever is remaining is called as the residuals. Decomposition should be Additive if the data is stationary and it should be multiplicative if the data is not stationary.
There is some upward trend and no seasonality value and some residuals: HAndling Trend: We should divide it by observed value . that is called detrend in case of multiplicative value. pd.Dataframe(res.observed/res.trend).plot() - this graph will remove trend. We can subtract the actual value with observed value in case of additive model. Note : additive model is applied for linear model and multiplicative for the non-linear data.
NAn states that t-1 and t-2 analysis
This answer is same as observed[1]
NON-Stationary - Go for ARIMA model (I component makes TS stationary- differencing (Typical value(d) is 2)) Stationary - ARMA Model (no trend and seasonlity) If the TS model has seasonality then, we can go for SARIMA model.
In ADF, null hypothesis states that TS is not Stationary and it possesses a unit-root Alternate Hypothesis is Ts is stationary p value less than 0.05 means, accept the alternate
In KPSS, null hypothesis states that TS is Stationary Alternat Hypothesis is Ts is not-stationary p value less than 0.05 means, accept the alternate
confidence interval by default is 80% so we are passing 95%